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ABSTRACT 

This  paper  describes  an  algorithm  for  the  conversion  of  a 
grammar  in  the  form  of  a  set  of  BNF  productions  into  a  deterministic 
parsing  algorithm  as  described  by  a  set  of  modified  Floyd  productions. 
The  algorithm  is  extended  in  such  a  way  that  it  may  easily  become  a 
part  of  a  complete  translator  writing  system  and  make  use  of  the 
information  available  in  the  semantic  part  of  such  a  system.   The 
paper  also  includes  a  discussion  of  the  implementation  of  the  extended 
algorithm  and  describes  potential  related  research. 
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1 .   INTRODUCTION 

The  algorithm  which  converts  Backus  Naur  Form  (BNF)  pro- 
ductions to  Floyd  Production  Language  (FPL)  productions  and  its 
implementation,  which  are  described  here,  grew  from  the  need  for  a 
procedure  oriented  language  for  the  ILLIAC  IV  computer  "being  designed 
by  and  built  for  the  University  of  Illinois.   Since  the  ILLIAC  IV 
is  a  completely  new  concept  in  digital  computers,  the  details  of  a 
language  best  suited  to  it  and  to  its  applications  are  relatively 
unknown.   For  this  reason,  the  use  of  a  compiler-compiler,  or 
translator  writing  system  (TWS)   is  preferable  to  writing  a  specific 
hard  code  program  for  the  translation  of  a  fixed  language.   It  is 
anticipated  that  the  use  of  a  TWS  will  make  the  task  of  changing 
the  language  easier  for  the  language  designers  whenever  the  need 
for  such  changes  should  appear.   It  is  also  hoped  that  the  avail- 
ability of  a  TWS  will  encourage  users  of  the  ILLIAC  IV  to  modify 
this  base  language  and  create  new  languages  more  nearly  suited  to 
specific  applications. 

The  goals  of  this  TWS  are  the  same  as  those  of  any  TWS; 
namely,  the  building  of  a  translator  which  is  fast,  occupies  as 
little  space  as  possible,  and  is  capable  of  generating  efficient 
code.  A  TWS  must  also  be  able  to  accept  the  grammars  of  a  large 
class  of  languages.   In  addition,  since  we  hope  to  encourage  the  use 
of  the  TWS  by  applications  people  who  are  often  not  familiar  with 
language  specifications,  it  is  important  that  the  language  specifi- 
cation input  to  the  TWS  be  in  as  simple  a  form  as  possible. 


Consideration  was  given  to  several  different  parsing 
irithms  as  potential  bases  of  the  system.  Among  these  were  the 
algorithms  of  McKeeman  [l],  Wirth  and  Weber  [2],  Ingerman  [3], 
Erooker  and  Morris  [U],  and  Trout  [5]*  Without  precluding  subsequent 
incorporation  of  some  of  the  other  methods,  it  was  decided  that  some 
version  of  the  production  language  (FPL)  of  Floyd  [6],  Evans  [7], 
and  Feldman  [8]  was  potentially  most  likely  to  meet  the  first  of 
the  above  criteria.   The  input  to  such  a  system  is,  however,  more 
complex  than  was  desirable. 

Chomsky  [9]  type  two  (context  free)  productions  were 
chosen  as  the  most  widely  known  and  easily  understood  method  of 
language  syntax  specification.  The  most  commonly  used  notation  for 
this  type  of  production  is  the  Backus-Naur  Form  (BNF).  Algorithms 
for  conversion  of  BNF  to  FPL  by  Earley  [10]  and  DeRemer[ll]  were 
considered  and  that  of  DeRemer  chosen  as  the  more  promising. 

This  thesis  is  a  description  of  an  extension  of  that 
algorithm  and  the  subalgorithms  used  in  its  implementation.   Brief 
ussions  of  the  properties  of  both  the  algorithm  and  its  imple- 
mentation are  included  where  appropriate. 


2.   THE  ALGORITHM 

The  algorithm  described  in  this  chapter  is  that  which 
was  proposed  by  F.  L.  DeRemer  [ll]  at  the  University  of  Illinois. 
Subsequent  chapters  will  describe  the  extensions  necessary  for  its 
practical  implementation.   The  notation  used  here  will  also  be 
used  in  the  following  chapters. 

2.1  Notation 

We  consider  a  language  L  to  be  defined  by  a  phrase 
structure  grammar 


G  =  (VT,  VN,  S,  P) 


where 


V  =  the  set  of  terminal  symbols  of  L  which  will  be 
represented  by  lower  case  Latin  letters, 


V  =  a  set  of  nonterminal  symbols  which  will  be  repre- 
sented by  upper  case  Latin  letters, 


S  e  V  is  the  objective  symbol 


P  =  a  numbered  set  of  rules  defining  a  language  L 
specified  by  the  user,  together  with  Z  ->  _[_  S 


Strings  of  symbols  will  be  denoted  by  Greek  letters,   "the  rules 
defining  L  will  be  Chomsky  type  two  productions  and  will  be  written 
in  the  form: 


N  -  a 


Symbol  X  (X  e  V  =  V  U  V  )  is  a  head  symbol  of  N  if 


N  =  X,   or 

N  -*  X  .  .  . ,   or 

there  exists  a  subset  of  P  of  the  form 


N  -  N  .  •  • 


Nl  7  N2  '  *  ' 


N  -»  X  .  .  •       n  >  1 
n  — 


The  string  to  be  parsed  is  assumed  to  be  of  the  form 


i*l 


where  a  €  L(G),  the  set  of  strings  defined  by  G,  or  a  e  V„,  ,  the 
set  of  all  strings  over  V . 

The  Chomsky  type  two  productions  will  be  referred  to 
simply  as  productions,  or  for  ease  of  notation,  BNF  productions. 


The  FPL  statements  will  be  referred  to  as  productions  (when  the 
context  makes  clear  which  type  of  production  is  meant)  or  FPL 
productions. 

An  FPL  production  will  be  written  in  the  form: 

Ll:pa|y-*N|*Ii 

The  label  LI  may  or  may  not  be  present.   The  string  (3  Ot   is  to  be  com- 
pared to  the  top  of  a  stack.   If  p  a  is  not  identical  to  the  top  of 
the  stack,  processing  is  transferred  to  the  beginning  of  the  next 
production  in  sequence.   If  3  a  is  identical  to  the  top  of  the  stack, 
the  terminal  symbol  string  7  is  compared  to  the  next  symbols  in  the 
input  stream.   The  lack  of  a  match  is  treated  in  the  same  way  as  a 
failure  to  match  the  stack.  Both  3  and  7  may  be  empty  as  will  be 
further  explained  later.   The  ->  N  indicates  that  a   at  the  top  of  the 
stack  is  to  be  replaced  by  the  single  symbol  N.   This  field  may  be 
empty.   The  *   may  or  may  not  appear  and,  if  present,  indicates  that 
the  next  input  symbol  is  to  be  scanned  into  the  top  of  the  stack. 
The  L  is  the  label  of  the  next  production  to  be  used. 

A  descriptor  (tt,  n)  associates  a  FPL  production  with  a 
BNF  production  in  the  sense  that  the  a   in  the  FPL  production  corres- 
ponds to  the  first  n  >  1  symbols  on  the  right  hand  side  (RHS)  of 
BNF  production  tt. 


2.2   label  Determination 

The  FPL  productions  are  grouped  and  a  label  is  attached  to 
the  first  production  of  each  group.  The  labels  are  thus  associated 
with  an  entire  group  rather  than  a  single  production. 

The  first  step  of  the  algorithm  is  to  determine  the  labels 
of  all  the  FPL  groups.   Let  X  (tt)  be  the  n-th  symbol  on  the  RHS  of 
BNF  production  tt.   There  are  three  rules  for  determining  which 
labels  (groups)  must  be  created. 


For  each  N  €  V  :  2.2.1 

label  Nh  exists  if  3  tt,  n  :  N  =  X  (tt),  n  >  1 

'         n 


For  each  N  e   V^:  2.2.2 

label  Nt  exists  if  3  TT,  n  :  N  =  X  (tt),  n  >  0 


For  each  t  e  V :  2.2-3 

label  t   (tt,   n)    exists   for  each  ir ,   n    :    t  =  X     (tt),   n  >  1 


2.3  Descriptor  Set  Generation 

The  next  step  is  to  determine  the  set  of  FPL  descriptors, 
D  =  ((tt.,  n.)   i  =  1,  2,    .  .  . ,  k}  for  each  FPL  group  which  must 
exist.   The  group  labels  are  descriptive  in  the  sense  that  the  three 
types  of  labels  generate  descriptor  sets  in  the  following  three 
.'erent  ways. 


D   =  {(tt,  1)  |  X  (tt)  e  V  and  the  left  side       2.3.1 
of  ir  is  a  head  symbol  of  N] 

DNt  =  {(tt,  n)  |  N  =  Xn  (tt),  n  >  0,  all  tt)  2.3-2 

Dt  (tt,  n)  =  <r*  n>  2-3-3 

2.4  Descriptor  to  FPL  Production  Mapping 

For  each  group,  the  descriptors  are  then  mapped  into 
FPL  productions.   If  a  is  a  string  of  length  n  >  1,  the  three 
mapping  rules  are: 

If  production  tt  is  M  -*  ON  .  .  .  2.4.1 

then  (tt,  n)  maps  into: 
a  |     |     *  Nh 

If  production  7T  is  M  -»  at  .  .  .  2.4.2 

then  (tt,  n)  maps  into 
a  |     |     *  t  (tt,  n  +  1) 

If  production  7T  is  M  -»  a  2.4.3 

then  (tt,  n)  maps  into 
a  I     ->  M  I     Mt 


In  addition,  two  special  groups  must  be  generated.   They 


are: 


START:  Eh:       |       |      *  Sh 


_  (0,  3):        S   _  |   -  £   |    Success  Exit 


2.5  Preclusion  and  Error  Production 

If,  within  any  group,  the  stack  comparison  string  <x  of 
one  FPL  production  is  such  that  it  may  preclude  the  application  of 
a  production  with  stack  comparison  string  cc  which  follows  it 
(if  a,  =  60!  or  a  =  60!-,),  a  contextual  analysis  must  be  done  on 
both  productions.   That  is,  the  strings  (3  of  symbols  which  may  pre- 
cede a  and  the  strings  7  of  terminal  symbols  which  may  follow  a   in 
the  particular  context  (as  defined  by  the  descriptor)  must  be  gener- 
ated to  sufficient  length  for  each  of  the  productions  so  that  (3  a  J 
allows  only  one  production  of  a  group  to  apply  for  any  entry  to  that 
group.  An  error  production  is  added  at  the  end  of  each  group. 


2.6  Summary 

The  algorithm  thus  consists  of  the  following  five  steps: 

1.  Read  the  BEF  grammer  of  L  and  add  the  special 

production  E  -*  J  S  j  ; 

2.  Determine  the  labels  of  the  FPL  groups  required; 

3-   Generate  descriptor  sets  for  each  group; 

k.     Map  the  descriptors  into  FPL  productions  and  create 
the  two  special  groups,  Eh,  and  J  (0,  3); 

5r  Do  the  necessary  contextual  analysis  to  remove 

preclusions  and  add  the  FPL  error  production  to 
each  group. 
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3-   CONTEXTUAL  ANALYSIS 

The  algorithm  described  in  chapter  two  is  complete  but 
cannot  be  the  basis  for  a  TWS  parsing  algorithm  unless  a  practical 
method  of  contextual  analysis  is  available.   Initial  attempts  to 
actually  generate  lookback  and  lookahead  symbol  strings  of  suffi- 
cient length  to  eliminate  preclusion  of  one  FPL  production  by 
another  proved  impractical.   The  generation  of  the  lookback  symbol 
string,  however,  is  relatively  redundant  in  the  sense  that  it  is 
implicit  in  the  grouping.   This  chapter  introduces  an  extension  of 
the  algorithm  which,  in  most  cases,  eliminates  preclusion  of  one 
production  by  another  when  the  stack  comparison  strings  of  the  two 
are  of  different  lengths.   This  extension  also  allows  a  subgrouping 
of  the  Nt  groups  which  introduces  more  implicit  lookback. 

The  use  of  the  terminal  symbol  lookahead  string  remains 
as  described  in  chapter  two.   The  generation  of  this  string  will  be 
described  in  a  later  chapter.   It  is  important  to  note,  however, 
that  the  generation  of  these  lookahead  strings  is  the  most  time  and 
space  consuming  part  of  an  implementation  of  this  algorithm.  Even 
more  important  is  that  the  translator  generated  by  a  TWS  based  on 
this  algorithm  will  be  slowed  down  by  excessive  use  of  lookahead 
since  all  possible  lookahead  strings  must  be  compared  with  the 
input  stream  if  any  lookahead  is  needed.   In  this  way  the  single 
FPL  production 


a 
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might  became 

a  |  71  .  .  . 

a  |  72  •  •  • 

a  I  7  ... 

1  'm 

For  these  reasons  it  is  important  to  limit  the  need  for  lookahead 
as  much  as  possible. 

3.1  Bar  Symbols 

The  extension  mentioned  above  is  the  insertion  of  a  special 
symbol  into  the  stack  to  mark  the  beginning  of  the  search  for  a  par- 
ticular nonterminal  symbol.   It  is  called  a  bar  symbol  and  is 
denoted  by  N.   Production  mapping  rule  2.4.1  is  therefore  modified 
to: 

If  production  it  is  M  ->  aN  .  .  .  3«1«1 

then  (it,   n)  maps  into 
a  I  «-  N  I    *  Nh 

where  *-  N  means  push  N  onto  the  stack. 

The  use  of  the  bar  symbol  makes  the  lookback  string  f3  the 
same  for  all  FPL  productions,  that  is,  one  symbol  which  may  be  any 
bar  symbol.  As  such  it  will  not  be  noted  in  FPL  productions  from 
this  point  on.  This  device  differentiates  in  all  but  one  special 
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case  (associated  with  the  removal  of  a  bar  symbol)  "between  two 
productions  which  have  stack  comparison  strings  of  different  lengths. 

3.2   ar  Symbol  Removal 

The  bar  symbol  N  is  inserted  in  the  stack  when  it  is  known 
that  nonterminal  symbol  N  is  a  subgoal  of  the  parse,  and  should, 
therefore,  be  removed  when  N  has  been  recognized.   It  is  implicit 
that  this  has  just  happened  upon  entry  to  the  Nt  group.   If,  however, 
N  is  explicitly  left  recursive  (ELR),  that  is,  there  exist  ELR  BNF 
productions  ir.  of  the  form 

N  ->  Na 

then  the  reduction  is  not  complete  until  the  corresponding  FPL  pro- 
ductions (i.e.,  those  with  descriptors  (ir. ,  l))  have  been  processed. 
This  type  of  FPL  production  will  be  referred  to  as  ELR. 

The  Nt  groups  thus  must  be  semi-ordered.   The  ELR  produc- 
tions must  come  first,  followed  by  a  bar  symbol  removal  production, 
and  then  the  rest  of  the  productions  of  the  group.   The  bar  symbol 
removal  production  is  of  the  form 

NN  I  •*  N  I 

In  this  case  a  =  NN,  3  is  empty  (not  any  bar  symbol),  and 
the  lack  of  a  label  for  transfer  indicates  that  the  next  production 
in  se<;      should  be  processed. 
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All  productions  of  an  Nt  group  have  N  as  the  last  symbol 
of  the  stack  comparison  string.  All  of  the  ELR  productions  have 
stack  comparison  strings  which  are  precisely  the  single  symbol  N. 
The  bar  symbol,  which  would  otherwise  prevent  the  preclusion  of 
the  longer  strings,  may  be  removed  between  the  processing  of  the 
ELR  productions  and  the  processing  of  the  non-ELR  productions. 
Therefore,  the  ELR  productions  will  preclude  all  other  productions 
of  the  group.   Lookahead  strings  must  be  generated  for  the  ELR 
productions  to  eliminate  this  preclusion. 

3-3  Nta  and  Ntb  Groups 

The  non-ELR  productions  of  an  Nt  group  fall  into  two 
classes:   those  with  descriptors  (it,  l),  called  class  A;  and  those 
with  descriptors  (it  ,  n)  with  n  >  1,  called  class  B.   Just  prior  to 
a  transfer  to  the  Nt  group  the  stack  has  N  in  the  top  position  and 
a  bar  symbol  in  the  second  position.   If  the  bar  symbol  is  N,  it 
has  been  put  into  the  stack  by  a  FPL  production  of  the  type: 

a  I     «-  N       *  Nh 

which  by  mapping  rule   3«1«1   comes  from  the  BNF  production  it   of  the 
form 

M  -»  ON   .    .    • 


Ik 

Therefore,  the  production  in  the  Nt  group  which  must  apply  (after 
ELR  productions  have  been  processed  and  the  bar  symbol  removed)  has 
the  descriptor  (ir ,  n  +  l)  where  n  is  the  length  of  the  non-empty 
string  OL.      This  production  is  in  class  B.   If,  on  the  other  hand, 
the  bar  symbol  is  not  N,  then  the  FPL  production  which  must  apply 
(after  ELR  processing)  comes  from  a  BNF  production  7T  of  the  form: 

M  ->  Na 

and  has  the  descriptor  (jr ,  l).   This  is  a  class  A  production.   Thus, 
the  bar  symbol  in  the  second  position  of  the  stack  can  be  used  as 
the  parameter  for  a  dynamic  transfer  to  either  a  group  labelled  Nta 
which  consists  of  the  ELR  productions  and  the  class  A  productions, 
or  a  group  labelled  Ntb  which  consists  of  the  ELR  productions,  the 
bar  symbol  removal  production,  and  the  class  B  productions. 

Since  the  bar  symbol  removal  production  cannot  apply  in 
the  Nta  group,  it  need  not  be  included  and  ordering  is  unnecessary. 
The  Ntb  group  is  identical  to  the  Nt  group  in  every  way,  except 
that  the  class  A  productions  have  been  deleted.   It  is  possible  for 
only  class  A  or  class  B  non-ELR  productions  to  exist,  in  which  case 
only  Nta  or  Ntb  groups,  respectively,  need  be  built. 

Ntb  {it  ,  n)  Groups 

The  Ntb  groups  can  be  further  split.   This  is  done  by 
associating  a  descriptor  with  each  bar  symbol  as  it  is  pushed  into 
the  stack.   If  the  FPL  production  which  pushes  N  into  the  stack  has 
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the  descriptor  {it,   n) ,  then  the  descriptor  for  the  non-ELR  Ntb  group 
production  which  must  eventually  apply  is  (it ,  n  +  l).  Therefore, 
a  separate  group,  labelled  Ntb  ("IT,  n),  is  built  for  each  class  B 
production.  An  Ntb  (it,  n)  group  contains  the  Nt  ELR  productions, 
followed  by  the  bar  symbol  removal  production,  and  the  single  pro- 
duction with  descriptor  (if,  n) . 

The  splitting  of  the  Nt  group  into  Nta  and  Ntb  (tt,  n) 
groups  completely  eliminates  the  need  for  lookahead  to  prevent  the 
preclusion  of  one  class  B  production  by  another.   Even  more  import- 
ant is  that  if  an  ELR  production  needs  several  multisymbol  lookahead 
strings  (i.e.,  3,  Qb,  ac,  Qd,  .  .  . )  to  prevent  it  from  precluding  a 
particular  non-ELR  production  but  needs  fewer,  shorter  lookahead 
strings  (i.e.,  p,  a)   to  prevent  preclusion  of  all  other  ELR  and 
non-ELR  productions,  then  the  larger  set  of  lookaheads  is  applied 
only  when  necessary. 

This  splitting  of  the  Nt  groups  and  the  associated  dynamic 
transfers  make  it  necessary  to  change  the  special  START  group  to 
make  it  consistent.   It  becomes: 

START:  Eh:     ]_  |  «-  S  (0,  2)   |   *  Sh 

where  the  special  BNF  production  Z  ->  ]_  S  j  is  assumed  to  be  number 
zero. 
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3-5  Summary 

With  the  extensions  described  above  the  various  rules  given 
in  chapter  two  are  changed  to  the  following: 

a)  Label  Determination: 


For  each  N  e  V :  3-5-1 


label  Nh  exists   if  3  TT,   n    : 

N  =  X     (ir),   n  >  1 
n 


For  each  N  €  V :  3-5-2 

label  Nta  exists   if  3  TT    : 
N  =  X      (ir),  ir  not  ELR 


For  each  N  e  V :  3-5-3 

label  Ntb   (x,   n)    exists  for  each  ir ,   n    : 
N  =  X     (ir),   n  >  1 


For  each  t  £   V    :  3-5-1* 

label  t   (tt,   n)   exists  for  each  ir ,   n    : 

t  =  X     (tt)    :   n  >  1 
n 
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b)  Descriptor  Set  Generation: 


DNh  =  [(jr>   1)  I  Xl  (Tr)  €  VT  3"5*5 

and  the  LHS  of  .production  tt 

is   a  head  symbol  of  N} 


DNta  =   [ijr>   D    I    N  =  x!   (t)}  3-5.6 


DNtb   (T,    n)    -«V   1}    '    N  =  Xl(7ri^  3'5'7 

7T    is  ELR}    U 

{bar  symbol  removal  production}    U 

{(tt,   n)}      (with  ordering) 

Dt  (r,  n)  "  {T>  n)  3-5-8 

c)  Mapping  Descriptors  to  FPL  Productions: 
(a   of  length  n  >  1) 

If  production  7T  is  M  -»•  aN  .  .  .  3* 5*9 

then  (tt,  n)  maps  into 
a,   I  ♦-  n  (tt,  n  +  1)  I     *Nh 

If  production  tt  is  M  -»  at  .  .  .  3-5.10 

then  (tt,  n)  maps  into 
a   I  -*  M  I     *t  (ir,  n  +  1) 
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If  production  7T  is  M-»a  3* 5*11 

then  (ir,  n)  maps  into 
a  |   -*  M  |    DMt 

where,  if  K  (ir ,  n)  is  the  bar  symbol  at  stack  level  two,  then 

K  =  M  =>  DMt  =  Mtb  (IT ,  n) 

K  *  M  =^>  DMt  =  Mta 
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1+.   COMBINED  GROUPS 

A  further  extension  of  the  algorithm,  which  makes  it 
possible  to  accept  a  larger  class  of  grammars,  is  the  combination 
of  two  or  more  FPL  productions  in  a  group  into  a  single  production. 
Two  productions  are  combined  (if  possible)  when  the  first  precludes 
the  second  and  the  finite  lookahead  is  not  sufficient  to  differen- 
tiate between  the  two.   Each  production  which  is  formed  by  such  a 
combination  creates  the  need  for  one  or  more  additional  groups. 

k.l     Ct  (m)  Groups 

The  simplest  combination  possible  is  of  two  or  more  FPL 
productions  formed  by  mapping  rule  3* 5*10.  If,  within  a  group, 
the  productions 


a 


a 


t  (-n^,  n) 


I     *  t  (7Tk,  n) 


should  appear,  and  if  a  differentiation  by  lookahead  is  not  possible, 
then  they  can  be  combined  into  the  single  production 


a 


*Ct  (m) 


U.l.l 
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where  this  is  the  m-th  such  combination  to  have  been  made. 
Associated  with  the  integer  m  is  the  set  of  labels 


£Ct(m)  =  {t  (7ri'  n)  I  ±  =  lf   2>    '    '    '>   k]  k'1'2 


The  descriptor  set  of  Ct(m)  is  defined  by: 


d„w  n  =     y  D  4.1.3 

ct(m)   ^ct(m)   q 


Group  Ct  (m)  is  then  precisely  the  union  of  the  various  groups 
which  could  have  been  transferred  to  had  the  combination  not  been 
made. 

The  Ct  (m)  group  generated  above  is: 

Ct  (m) :   at  I  .  .  • 


at   ... 


error 


As  is  easily  seen,  the  preclusion  has  not  been  eliminated,  but  simply 
moved  to  a  different  group.   This  group,  however,  has  one  more  symbol 
in  the  stack  comparison  field  and  therefore  the  maximum  length  look- 
ahead  c  reaches  one  symbol  further  into  the  input.   The  net 
effect  of  a  combination  of  this  type  is  the  implicit  one  symbol 

<  nsion  of  the  maximum  lookahead  capability  of  an  implementation 
of  the  algorithm. 
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k.2     ch  (m)  Groups 

Another  possible  combination  is  of  productions  formed  by 
mapping  rule  3* 5*9  which  cause  bar  symbols  of  different  nonterminals 
to  be  pushed  into  the  stack.   This  cannot  be  done  under  a  certain 
condition,  but  it  will  be  shown  later  in  this  chapter  how  this  condi- 
tion can  be  eliminated. 

If,  within  a  group,  the  productions 


a 


a 


|     *-  Nx  (7r1,  n)  I    *  Kjh 


N,  (w.   ,  n)      *  Nvh 
k  v  k  k 


should  appear,  and  if  a  differentiation  by  lookahead  is  not  poss- 
ible, then  they  can  be  combined  into  the  single  production 


a 


(m) 


ch  (m) 


4.2.1 


where  this  is  the  m-th  such  combination  to  have  been  made.   The  symbol 
(m)  is  called  a  metabar  symbol  and  represents  the  set  of  bar  symbols 
which  could  have  been  pushed  into  the  stack  by  the  various  produc- 
tions which  were  combined.   That  is: 


(m)  =  {^  (tti,  n),  .  .  .,  Nk  (irk,  n)} 


1+.2.2 


The  sets  represented  by  the  metabar  symbols  must  be  avail- 
able when  the  FPL  productions  are  used  to  parse  an  input  string. 


22 
re  axe  two  reasons  for  this.  The  bar  symbol  removal  production 

NN  |   -  N   | 

must  apply  when  stack  level  one  contains  N  and  stack  level  two  con- 
tains N,  or  stack  level  two  is  (m)  and  N  e  (m) .   The  second  use  of 
these  sets  is  in  applying  the  dynamic  transfer,  DMt,  defined  in 
3.5.11.   The  definition  of  DMt  remains  the  same  if  K  (tt,  n)  is  the 
bar  symbol  at  stack  level  two.  However,  if  the  symbol  at  stack 
level  two  is  (m),  then  DMt  is  defined  as  follows: 

M  (tt,  n)  g  (m)  =>  DMt  =  Mtb  (it  ,  n)  4.2.3 

M  (tt  ,  n)  £   (in)  =>  DMt  =  Mta 

The  metabar  symbol  (m)  also  satisfies  the  implicit  one  symbol  look- 
back for  any  bar  symbol. 

The  production  formed  by  a  combination  of  this  type  indicates 
a  transfer  to  label  ch(m) .   This  type  of  group  is  necessary  whenever  such 
a  combination  is  made.   Its  descriptor  is  indirectly  defined  by  (m) 
as  follows : 


ch(m)   -  ■  /— n     IJh 
v  '   N  e  (m) 
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Again  the  new  group  to  be  built  is  exactly  the  union  of  all  the 
groups  which  could  have  been  transferred  to  by  the  productions 
which  were  combined. 

U.3  CNtb  (m)  Groups 

A  third  type  of  combination  is  of  productions  formed  by 
mapping  rule  3-5*9  where  the  bar  symbols  to  be  pushed  into  the 
stack  are  all  for  the  same  nonterminal  symbol.   If,  within  a  group, 
the  productions 


a 


N  (tt.,  n)     *  Nh 


a  I   4-  N  (tt,  ,  n)   |  *  Nh 


should  appear,  and  if  a  differentiation  by  lookahead  is  not  possible, 
then  they  can  be  combined  into  the  single  production 


a 


N  *  (m) 


*  Nh 


U.3-1 


where  this  is  the  m-th  such  combination  to  have  been  made.   The 
special  bar  symbol  N  *  (m)  is  the  same  as  N  in  terms  of  implicit 
lookback  and  bar  symbol  removal.   It  does,  however,  call  for  a 
further  definition  of  the  dynamic  transfer,  DMt,  of  3«5-H  and 
k.2.3-      This  transfer  remains  the  same  under  the  previously  defined 


2k 

stack  level  two  conditions,  but  is  extended  when  stack  level  two  is 
•  (m)  as  follows: 

K  =  M  =>  DMt  =  CMtb  (m)  U.3.2 

K  *  M  =>  DMt  =  Mta 

The  label  of  a  group  which  is  built  whenever  a  combination 
of  this  type  is  made  is  CMtb  (m) .  Associated  with  the  integer  m  is 
the  set  of  labels: 


£CMtb  (m)  =  {Mtb  (7ri'  n)  |  i  =  1,  2,  .  .  .,  k}        U.3-3 


The  descriptor  set  for  this  new  group  is  then  defined  by 


>W  W  -  ,e,CMtb  I         \ 


In  this  case  the  productions  which  were  combined  each  pushed  into 
the  stack  a  bar  symbol  which  defined  a  dynamic  transfer  to  an 
Mtb  (tt f   n)  group.   This  new  group  which  must  be  formed  is  the 
union  of  those  Mtb  (tt,  n)  groups.   It  must  include  the  bar  symbol 
removal  production  and  be  semi-ordered  in  the  same  way  as  a 
Mtb  (tt  ,  n)  group. 
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4.4  N  *  (m)  as  an  Element  of  (k) 

An  extension  of  the  devices  of  sections  4.2  and  4.3 
allows  the  combination  of  any  set  of  productions  formed  by  mapping 
rule  3- 5 • 9«   If  the  bar  symbols  to  be  pushed  into  the  stack  by  the 
productions  to  be  combined  are  mixed,  in  the  sense  that  some  are  of 
different  nonterminal  symbols  and  some  are  of  the  same  nonterminal 
symbols,  then  the  procedures  of  the  previous  two  sections  do  not 
apply. 

In  this  ease  the  procedure  of  section  4.3  is  applied  to 
whichever  sets  of  productions  qualify,  and  the  procedure  of  section 
4.2  is  applied  to  the  resulting  set  of  productions.   Two  of  the 
rules  of  section  4.2  must  be  amended  in  this  case  because  metabar 
symbol  (k)  now  defines  a  set  which  contains  both  bar  symbols, 
N  (j:,  n),  and  special  bar  symbols,  N  *   (m) .  First,  the  part  of  the 
definition  of  DMt  of  4.2.3  must  be  changed.   It  becomes: 

If  the  symbol  at  stack  level  two  is  (k),  then       4.4.1 

M  {it ,   n)  e  (k)  =>  DMt  =  Mtb  (ir ,   n) 


M  *  (m)  €  (k)   =>  DMt  =  CMtb  (m) 


M  (7T,  n)  <ji   (k),  M  *  (m)  ^  (k)  =>  DMt  =  Mta 
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Rule  U.2.4  which  defines  the  descriptor  set  for  ch  (k)  must  also  be 
changed  to  read  as  follows : 


Dch  00  - 1  •  (k)  u    DNh  h-h-2 

N  *  e  (k) 


U.5  Initial  Revision  of  BNF  Grammar 

Section  U.2  made  reference  to  a  condition  which  prevented 

the  combination  of  productions  as  described  in  that  section.   This 

occurs  when  the  nonterminal  symbols  associated  with  the  bar  symbols 

to  be  pushed  into  the  stack  by  two  of  the  productions  are  such  that 

one  is  a  head  symbol  of  the  other.   That  is,  when  N  (tt  . ,   n)  and 

1  (tt  .,  n)  would  be  elements  of  (m)  and  N  is  a  head  symbol  of  M. 
J 

This  will  always  cause  a  dynamic  transfer  to  Ntb  (tt    ,   n)  upon 
recognition  of  N.   This  ignores  completely  the  possibility  that 
this  particular  occurrence  of  N  is  the  beginning  of  a  substring 
which  will  eventually  reduce  to  M  (in  which  case  the  transfer 
should  be  to  Nta) . 

The  problem  can  be  overcome  by  a  simple  revision  of  the 
BNF  grammar  before  any  BNF  production  to  FPL  conversion  is  done. 
This  involves  a  search  of  all  the  BNF  productions  for  pairs  of  the 
form: 


ON 


B  ■+  OtX    .  .  . 
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where  A  may  or  may  not  be  the  same  as  B,  a   is  a  string  of  length 
n  >  1,  and  X  e  V  =  V U  v,  and  X  is  a  head  symbol  of  N.   This  pair 
is  then  replaced  by 

A  -»  ON   .    .    . 


B  -+  aQ 


(Q  is   a  new  nonterminal 
symbol. ) 


and  the  production 


Q  ->  X 


is  added  to  the  set  of  BNF  productions. 

This  revision  of  the  BNF  grammar  does  not  change  the 
language  being  specified  but  does  keep  the  previously  described 
condition  from  occurring  and  preventing  the  combination  of  pro- 
ductions when  necessary.  Allowing  X  to  be  a  terminal  symbol  in  the 
revision  also  makes  certain  that  there  will  never  be  preclusion 
involving  FPL  productions  generated  by  both  mapping  rule  3«5«9 
and  mapping  rule  3«5«10.   Thus,  the  only  case  where  combination 
of  productions  cannot  be  used  to  overcome  preclusion  is  when  one 
or  more  of  the  FPL  productions  involved  has  been  generated  by 
mapping  rule  3«5«H« 
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U.6  __ 

The  total  algorithm  is  now  extended  to  include  an  initial 
BNF  grammar  revision.   In  addition,  preclusions  which  cannot  be 
resolved  by  lookahead  and  do  not  involve  an  FPL  production  which 
is  generated  by  mapping  rule  3« 5*11  may  be  eliminated  by  combining 
productions  and  creating  new  groups.   These  combination  procedures 
use  the  new  special  symbols  (m)  and  N  *  (m) .   The  definition  of  the 
dynamic  transfer  label,  DMt,  is  extended  to: 

If  stack  level  two  is  K  (ir,  n)  then  4.6.1 

K  =  M  =>  DMt  =  Mtb  (fr,n) 
K  *  M  =>  DMt  =  Mta 

If  stack  level  two  is  (m)  then  4.6.2 

M  (tt,  n)  e  (in)  =>  DMt  =  Mtb  (it,   n) 
M  *  (k)  €  (in)   =>  DMt  =  CMtb  (k) 
M  (tt,  n)  i   (in)  and  M  *  (k)  ^  (m)  => 

DMt  =  Mta 

If  stack  level  two  is  K  *  (m)  then  U.6. 3 

K  =  M  =>  DMt  =  CMtb  (m) 
K  *  M  =>  DMt  =  Mta 
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5-   SEMANTICS  AND  EMPTY  BNF  PRODUCTIONS 

The  algorithm  described  in  chapters  one  through  four  has 
been  implemented  on  a  Burroughs  B5500  computer  at  the  University 
of  Illinois.  The  input  to  the  program  is  the  grammar  of  a  language 
in  the  form  of  a  set  of  modified  BNF  productions.   The  output  is  a 
stream  of  pseudo  orders  corresponding  to  the  FPL  productions  gen- 
erated, and  a  set  of  tables,  one  of  which  contains  the  definitions 
of  the  metabar  symbols. 

5.1  Semantic  Routines 

The  BNF  productions  include  two  different  applications  of 
semantics.   The  semantics  are  written  separately  as  a  set  of  num- 
bered routines  which  are  accessible  to  the  parser  interpreter  which 
causes  the  pseudo  orders  to  be  executed.  The  semantic  routines 
are  associated  with  the  BNF  productions  by  placing  the  appropriate 
number  (preceded  by  #)  anywhere  on  the  RHS  of  a  production,  except 
at  the  beginning.  For  example: 


1)  <A> 

2)  <A> 

3)  <A> 


=  a  <B>  #1 
=  a  #2  <C> 
=  a  #3  <D>  #1 


would  all  be  valid  input  productions.   Since  the  total  set  of  FPL 
productions  has  at  least  one  production  for  each  possible  descriptor, 
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the  semantic  routines  can  be  executed  precisely  when  the  parse 
reaches  the  point  indicated  in  the  BNF  grammar.   This  requires  an 
extra  field  in  the  FPL  productions  which  applies  in  the  same  way 
for  all  of  the  mapping  rules.  The  general  FPL  production  becomes 

Ll:0a|7-+N|m*L 

where  m  is  an  integer  referring  to  a  semantic  routine  which  is  to 
be  called  immediately  upon  recognition  of  p,  a  in  the  stack  and  7 
in  the  input  stream,  and  before  any  stack  manipulation  is  done.  All 
other  fields  of  the  production  are  as  previously  described.   The 
mapping  rules  are  extended  to  include  the  setting  of  this  semantic 
action  field  when  the  descriptor  being  mapped  is  (ir,  n)  and  #  <integer> 
follows  symbol  n  of  production  ir . 

Each  stack  position,  in  addition  to  containing  a  syntactic 
symbol,  also  contains  a  field  which  the  language  designer  may  use 
(through  semantic  routines)  for  information  associated  with  that 
particular  syntactic  symbol.   The  semantic  routines  are  also  used 
for  many  other  things  such  as  building  identifier  tables  and  gen- 
erating object  or  intermediate  language  code. 

The  use  of  semantic  routines  does  place  one  restriction 

on  the  algorithm.  A  set  of  FPL  productions  which  should,  accord- 

•  to  the  procedures  described  in  chapter  four,  be  combined,  may 

not  be  combined  if  they  have  different  semantic  actions.  This  is 
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because  the  appropriate  semantic  routine  must  be  executed  before 
continuing  the  parse.  For  example,  the  <A>h  group,  from  the  above 
BNF  productions,  would  be 

<A>h  :    a  |    *.  B  (l,  2)  |  *  Bh 

a  |    «-  C  (1,  2)  |  2  *  Ch 

a  |    «-  D  (l,  2)  |  3  *  Dh 
error 

The  preclusion  involved  here  must  be  overcome  by  lookahead  or  not 
at  all.   This  restriction  has  little  practical  effect  on  the  capa- 
bilities of  the  system  since,  in  actual  use  of  the  system,  the  vast 
majority  of  semantic  routine  designations  are  at  the  end  of  RHS's, 
(i.e.,  semantic  actions  to  be  performed  upon  recognition  of  complete 
constructs) . 

5»2  Semantic  Conditions 

A  second  use  of  semantic  routines  allows  the  language 
designer  to  effect  the  parse  by  differentiating,  with  semantics, 
between  otherwise  identical  syntactic  symbols.   Semantic  routines 
used  for  this  purpose  are  called  semantic  conditions  and  are 
denoted  in  both  BNF  and  FPL  productions  by  <#m>  where  m  is  an 
integer  and  refers  to  semantic  routine  number  m.   These  semantic 
conditions  are  associated  with  a  particular  use  of  a  symbol  (either 
terminal  or  nonterminal)  by  inserting  them  into  the  BNF  productions 
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immediately  following  that  symbol  on  the  appropriate  RHS.   The 
system  considers  the  semantic  condition  as  a  permanent  part  of  the 
symbol  which  precedes  it  and,  in  fact,  treats  that  symbol  as  being 
syntactically  different  than  the  same  symbol  with  a  different 
semantic  condition,  or  with  no  semantic  condition. 

The  following  example  illustrates  the  use  of  semantic 
conditions.  Assume  the  following  productions  are  part  of  a  BNF 
"rammar . 

<A>  -*  a  <#1> 
<A>  ->  a  <#2> 
<A>  -  a 

The  Ah  group  then  would  be 

Ah :   a  <#1>   |   -*  A  |  DAt 

a  <#2>   |   -v  A  |  DAt 

a    |   ->  A  |  DAt 
error 

No  preclusion  exists  since  the  three  single  symbol  stack  comparison 
B  are  considered  by  the  system  as  being  different  symbols. 

The  semantic  routines  which  are  used  as  semantic  conditions 
may  do  anything  that  any  other  semantic  routine  may  do.  They  must, 
however,  also  set  a  fixed  boolean  variable  in  the  skeleton  program 
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either  true  or  false.   The  symbol  X  <#m>  in  the  stack  comparison 
string  of  an  FPL  production  is  processed  by  first  comparing  the 
symbol  X  with  the  stack  in  the  normal  manner  and  then  executing 
semantic  routine  number  m.   If  the  value  returned  is  true,  then 
the  symbol  is  considered  to  have  matched  the  stack  and  processing 
of  that  production  continues.   If  it  is  false,  the  stack  compar- 
ison has  failed  and  processing  is  transferred  to  the  beginning  of 
the  next  production  in  sequence. 

The  use  of  semantic  conditions  increases  drastically  the 
capabilities  of  the  system  in  that  they  can  be  used  in  conjunction 
with  normal  semantic  routines  to  refer  back  to  every  step  of  the 
parse  which  has  preceded  their  execution.   They  must  be  used  with 
great  care,  however,  since 


a  <#n> 


is  a  preclusion  which  is  not  recognized  by  the  system.   In  addition, 


<#n> 


<#m> 


may  also  be  a  preclusion  not  recognized  by  the  system  if  the  user  has 
failed  to  effectively  differentiate  between  the  two  cases. 
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5.3     Empty  BNF  Prod-actions 

An  empty  BNF  production  is 

N     -     \ 

where  X  is  the  null  string.   The  use  of  the  empty  production  is 
unnecessary  in  specifying  a  language.   It  is,  however,  useful  in 
that  it  can  make  the  specification  of  a  grammar  easier  for  the 
language  designer.   Since  this  is  one  of  the  primary  goals  of  this 
system,  it  is  allowed. 

The  algorithm,  as  described,  does  not  handle  the  empty 
production,  so  an  initial  pass  is  made  which  completely  removes 
all  empty  productions  by  a  straight- forward  back  substitution. 
(It  is  possible  to  completely  remove  them  because  of  the  form  of 
the  production  E  ■+  J_  S  J_. )   This  may  affect  semantic  routines,  as 
in  the  following  example : 

A  -*•  aBc  #1 


B  ->  \ 


B  -  b 
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which  becomes 


A  -*  aBc  #1 


A  -*  ac  #1 


B  -*  t> 


Semantic  routine  number  one  may  refer  to  stack  level  three  expecting 
that  it  is  the  level  of  symbol  a.  Whenever  such  an  effect  is  poss- 
ible (i.e.,  whenever  the  number  of  symbols  preceding  a  semantic 
routine  call  is  reduced)  a  message  is  printed  to  that  effect  and 
processing  is  continued  with  the  assumption  that  the  semantic  routine 
is  not  affected.   It  becomes  the  user's  responsibility  to  rewrite  the 
semantic  routine  if  necessary.   This  is  not  a  serious  problem  since, 
in  most  cases,  the  syntactic  part  of  the  language  is  written  and 
processed  prior  to  the  actual  writing  of  the  semantics. 
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6.   IMPLEMENTATION 

This  chapter  will  describe  the  more  interesting  features 
of  the  implementation.   Emphasis  will  be  given  to  those  parts  of 
the  algorithm  l)  which  this  implementation  may  modify  slightly, 
or  2)  for  which  the  method  of  implementation  is  not  relatively 
obvious . 

6.1  BNF  Production  Storage 

The  BNF  productions  are  first  converted  into  a  numerical 
form  and  the  RHS's  are  stored  in  a  logical  one  dimensional  array 
called  PR0TAB.   Each  RHS  symbol  uses  one  word  of  the  array  with  an 
extra  word  for  semantic  conditions.   Each  word  is  broken  into  sev- 
eral fields  which  contain  the  following  information: 

1)  a  two  part  numerical  representation  of 
the  symbol 

2)  which  (if  any)  semantic  routine  call 
follows  this  symbol 

3)  the  number  of  words  to  the  beginning 
of  the  next  RHS 

U)   whether  this  symbol  is  semantically 
conditioned 
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5)  whether  this  is  the  leftmost  symbol  of 

and  ELR  production 


6)  the  nonterminal  symbol  which  is  the  left 
side  of  this  production. 

All  of  the  symbols  used  in  the  algorithm  are  separated 
into  the  following  types: 

nonterminal  symbols 
terminal  symbols 
bar  symbols 
metabar  symbols 
special  bar  symbols 
semantic  conditions 
identifiers 
numbers 
strings 

The  semantic  conditions  are  not  actually  symbols  but 
the  implementation,  in  many  ways,  treats  them  as  if  they  are. 
Identifiers,  numbers,  and  strings  are  specified  in  the  BMF  grammar 
by  <*I>,  <*N>,  and  <*S>,  respectively.   These  are  considered  by  the 
system  to  be  three  different  terminal  symbols.   These  nine  differ- 
ent classes  of  symbols  are  each  given  a  unique  type  number. 

The  second  part  of  the  numerical  representation  of  the 
first  six  classes  of  symbols  is  a  number  corresponding  to  a 
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particular  symbol  or  semantic  routine  within  that  class.   Identifiers, 
numbers,  and  strings  have  no  such  entries.   The  scan  procedure  of  the 
skeleton  program  recognizes  these  as  entities  and  uses  the  entry- 
field  in  the  stack  as  a  pointer  to  the  address  in  the  ECD  symbol 
table  of  the  particular  identifier,  number,  or  string  recognized. 
This  pointer  then  can  be  used  by  the  semantic  routines.   Special 
(or  reserved)  words  of  the  language  are  specified  in  the  syntax  as 
alphanumeric  character  strings  (i.e.,  #BEGIN)  and  each  is  given  a 
separate  entry  number  within  the  class  of  terminal  symbols.   One 
of  the  tables  output  by  the  conversion  program  is  a  list  of  these 
special  words  and  their  associated  entry  numbers  for  use  by  the 
scan  procedure. 

The  above  described  method  of  storing  the  BNF  productions 
allows  the  conversion  routine  almost  immediate  access  to  most  of 
the  information  it  needs,  while  at  the  same  time  using  as  little 
space  as  possible.   During  the  course  of  this  conversion  certain 
checks  are  made  of  the  syntax.  Among  these  are  tests  that  all  non- 
terminal symbols  (except  Z)  appear  on  both  right  and  left  sides  of 
BNF  productions,  and  a  check  that  there  is  at  least  one  string  of 
terminal  symbols  which  will  reduce  to  each  nonterminal  symbol. 

■-•jnmar  Revisions 

During  the  course  of  the  BNF  production  storage,  the  non- 
terminal symbols  which  occur  on  the  left  side  of  empty  productions 
are  stored.   This  information  is  used  to  eliminate  all  empty  produc- 
tions in  the  manner  described  in  section  5 -3- 
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At  this  point,  two  dimensional  boolean  arrays  which  mark 
the  nonterminal  head  symbols  (NTHS),  the  terminal  head  symbols  (THS), 
and  the  nonterminal  tail  symbols  (NTTS,  defined  analogously  to  head 
symbols)  of  each  nonterminal  are  filled.   For  entry  into  the  THS 
array  the  special  terminal  symbols  identifier,  number,  and  string 
are  put  at  the  end  of  the  other  terminal  symbols.   These  tables 
are  initialized  as  follows: 

If  g  a  production  M  -+  N  .  .  . 

then  NTHS  [M,  N]  =  true  else  false 

If  g  a  production  M  -*■  t  .  .  . 

then  THS  [M,  t]  =  true  else  false 

If  g  a  production  M  -»■  .  .  .N 

then  NTTS  [M,  N]  =  true  else  false 

where  a  symbol  used  as  a  subscript  refers  to  the  entry  field  for 
that  symbol  (except  identifier,  number  and  string).   The  tables 
are  then  filled  by  iteratively  applying  the  principal  that  if  X 
is  a  head  (tail)  symbol  of  N  and  N  is  a  head  (tail)  symbol  of  M, 
then  X  is  a  head  (tail)  symbol  of  M.   These  tables  are  completed 
by  setting  NTHS  [N,  N]  and  NTTS  [N,  N]  equal  true  for  all  non- 
terminal symbols  N. 

At  this  point  the  RHS's  are  compared  in  pairs  and  all 
necessary  grammar  revisions  are  made  as  described  in  section  k.^. 
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The  BNF  grammar  is  now  in  a  form  acceptable  to  the  conversion 
algorithm  and  the  NTHS  and  NTTS  tables  are  available  for  use 
when  applicable. 

Before  conversion  actually  begins,  a  table  of  pointers 
to  each  RHS  occurrence  of  each  nonterminal  symbol,  and  a  table  of 
pointers  to  RHS  beginnings  corresponding  to  the  use  of  each  non- 
terminal symbol  as  a  left  side  are  constructed.   These  are  used 
to  eliminate  excessive  scanning  of  the  entire  array  of  RHS  symbols. 

6.3   *abel  Determination 

Very  little  is  necessary  for  label  determination.  All 
nonterminal  symbols  require  an  Nt  group  which  is  separated  into 
Nta  and  Ntb  (ir,   n)  groups  in  a  manner  which  will  be  further  dis- 
cussed later.  A  single  scan  through  the  BNF  production  table 
(PRjfcAB)  is  sufficient  to  determine  which  Nh  groups  are  necessary. 
No  label  determination  is  done  for  the  t  (?r,  n)  groups.   Instead, 
when  the  conversion  of  an  FPL  production  to  pseudo  orders  reaches 
a  label  for  transfer  to  a  t  (tt,  n)  group,  the  single  FPL  produc- 
tion of  that  group  is  generated  and  the  pseudo  orders  for  it  are 
placed  directly  into  the  pseudo  order  stream.   This  saves  unnec- 
cessary  transfers  when  the  pseudo  orders  are  processed.   The  last 
production  in  a  sequence  of  su<     "rations  is  followed  by  the 
error  pseudo  order. 

As  the  FPL  productions  are  generated  in  the  Nta, 
(tt,  n),  and  Nh  groups  the  need  for  Ct  ("ir,  n) ,  ch  (m),  and 
(m)  groups  arises.  Whenever  such  a  need  occurs  the  name 
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of  the  group  needed  is  simply  inserted  into  a  list,  along  with  a 
pointer  to  the  information  necessary  to  build  the  group.   These 
groups  are  created  one  at  a  time  after  the  Nta,  Ntb  (ir,   n),  and 
Nh  groups  have  been  built.   The  generation  of  a  combined  group 
may  add  to  the  list  of  combined  groups  which  are  needed. 

6.U  Descriptor  Set  Generation 

A  descriptor  will  continue  to  be  denoted  by  (ir,  n) .   It 
is,  however,  actually  a  single  number  which  points  at  a  word  in 
PRj^TAB.   The  descriptor  set  for  a  Nt  group  is  then  precisely  the 
entries  for  N  in  the  table  of  pointers  to  RHS  occurrences. 

The  descriptor  set  for  a  Nh  group  is  generated  by  using 
information  available  in  each  word  to  step  through  the  first  symbol 
of  each  RHS  in  PRj^TAB.   If  this  first  symbol  X  is  a  terminal  symbol 
and  if  the  left  side  M  of  this  production  is  such  that  NTHS  [N,  M]  = 
true,  then  the  pointer  to  that  first  symbol  is  a  descriptor  in  the 
Nh  group.   The  descriptor  sets  for  the  Nh  groups  are  saved,  even 
after  the  groups  have  been  built. 

The  descriptor  denoted  (ir,   n  +  l)  is  (ir,   n)  +  1  if  the 

symbol  at  (ir,  n)  is  not  semantically  conditioned.  It  is  (tt,  n)  +  2 

if  the  symbol  at  (ir,   n)  is  semantically  conditioned.   The  single 

descriptor  for  the  t  (tt,  n)  group  is  available  when  the  need  for 

that  group  arises.   The  descriptor  sets  for  Ct  (tt,  n) ,  ch  (m),  and 

CNtb  (m)  groups  are  generated  by  a  straight  forward  application  of 

U.1.3,  k.2.k,    and  k.3.k.      The  label  sets  £_.  ,  N  and  £___.  ,  N  are 
'  '  Ct  (m)      CNt  (m) 

the  information  associated  with  names,  as  discussed  in  section  6-3« 
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6.5  Descriptor  Set  to  FPL  production  Mapping 

The  mapping  of  descriptors  to  FPL  productions  is  very 
simple.   The  descriptor  is  first  increased  by  one  if  the  symbol 
pointed  at  is  semantically  conditioned.   The  stack  comparison  field 
of  the  FPL  production  is  thirty-two  (32)  words  long  and  is  filled, 
right  justified,  directly  from  PR0TAB  using  the  descriptor.   The 
next  word  is  the  semantic  routine  to  be  called  and  is  filled  from 
the  appropriate  field  of  the  last  word  in  the  stack  comparison. 
The  next  word  is  minus  one,  one,  or  zero  corresponding  to  «-,  ->•, 
or  blank,  respectively.   The  following  word  is  the  bar  symbol  to 
be  pushed,  the  nonterminal  which  the  stack  is  to  be  reduced  to, 
or  zero.   The  next  word  is  one  or  zero  for  scan  or  no  scan,  respec- 
tively.  This  is  followed  by  the  two  part  numerical  label  name. 
These  labels  are  formed  similarly  to  the  symbols  in  that  there  are 
five  classes  or  types.   The  entry  fields  are  filled  as  follows: 

Type  Entry 

Ct  (m)  m 

ch  (m)  m 

Nh  N 

t  (n,  n)  (n,  n) 
DNt  N 
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The  last  word  of  a  numerical  FPL  production  is  the  revised  descriptor 
from  which  it  came.  Much  of  the  information  in  the  production  is 
redundant.   This  format  is  used  because  it  simplifies  the  conver- 
sion into  pseudo  orders,  a  list  of  which  is  given  in  Appendix  A. 

6.6  Preclusion  Elimination 

The  splitting  and  ordering  of  the  Nt  group  is  more  easily 
described  if  the  elimination  of  preclusions  is  explained  first. 
Assume  for  this  explanation  that  no  ordering  or  splitting  is 
necessary. 

All  of  the  numerical  FPL  productions  for  a  particular 
group  are  generated  and  placed,  one  per  line,  into  a  table  for 
processing.   The  first  production  is  compared,  one  at  a  time,  with 
the  productions  which  follow  it.   If  the  stack  comparison  string 
for  the  two  productions  is  identical,  then  a  preclusion  exists  and 
lookahead  generation  is  necessary.   The  two  productions  are  asso- 
ciated with  two  identical  buffers.   Each  line  of  these  buffers  has 
two  words  of  control  information  plus  n  additional  words,  where  n 
is  the  maximum  number  of  symbols  of  lookahead. 

The  two  productions  are  expanded  into  the  first  lines  of 
the  two  buffers  as  follows : 

word  one:   the  entry  part  of  the  left  side  of  the 
BNF  production  which  formed  this  FPL 
production. 
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word  two:   zero 

the  rest  of  the  line  is  filled,  left  justified, 
by  the  symbols  of  this  RHS  which  are  not  in 
the  stack  comparison  field  (as  many  as  will 
fit). 

Figure  1  is  the  flow  chart  for  the  generation  of  look- 
ahead  with  the  variables  used  defined  as  follows: 

I,  J,  K  are  loop  indices 

LK1  is  the  buffer  associated  with  the 

first  FPL  production 
LK2  is  the  buffer  associated  with  the 

second  FPL  production 
LL  is  the  maximum  lookahead  length 
LK1PT  is  a  pointer  to  the  first  empty  line 

of  LK1  (initially  equal  two) 
LK2PT  is  a  pointer  to  the  first  empty  line 

of  LK2  (initially  equal  two) 

The  procedure  GETSYM  extends  the  string  in  line  I  (J)  of 
buffer  LK1  (LK2)  with  pointer  LK1PT  (LK2PT)  by  finding  all  PRjfrTAB 
urrences  of  nonterminal  symbols  which  are  not  the  last  symbols 
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Figure  1.  Flow  Chart  for  Lookahead  Generation 
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of  a  RHS,  and  which  are  marked  in  array  NTTS  as  having  the  terminal 
symbol  of  LK1  [l,l]  as  a  tail  symbol.  Each  such  occurrence  is  used 
to  create  a  new  line  in  LK1  as  follows: 

word  one:   the  left  side  symbol  of  the  occurrence; 

words  three  through  K-l:   identical  to  row  I; 

words  K  through  LL  +  2:  the  symbols  which  follow 
the  occurrence  in  that  RHS  (left  justified). 

The  procedure  NT2T  expands  the  nonterminal  symbol  at  LK1  [I,  K] 
(LK2  [J,  K])  by  replacing  it  with  each  RHS  which  has  it  as  a  left 
side.   Each  such  RHS  creates  a  new  line  as  follows: 

words  one  through  K-l:   identical  to  row  I; 

starting  at  word  K:   the  symbols  of  the  RHS 

(symbols  which  would  fill  the  line  beyond 
LL  +  2  are  ignored); 

ese  words  through  word  LL  +  2  (if  free  words 
lable) :  the  words  of  line  I  starting  at 
we-      1. 
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For  both  GETSYM  and  NT2T,  line  I  is  saved  at  the  end  of 
the  buffer.   Each  new  line  created  is  compared  with  all  lines  pre- 
viously processed  (lines  one  through  1-1  and  the  lines  at  the  end 
of  the  buffer).   If  the  new  line  (except  word  two)  is  identical  to 
any  of  these  it  is  ignored.  The  first  new  line  not  ignored  is  put 
at  Line  I  and  the  rest  are  placed  starting  at  LK1PT  (which  is 
incremented  by  one  for  each  such  line) .   If  all  new  lines  created 
by  GETSYM  or  NT2T  while  processing  line  I  of  LK1  are  ignored, 
then  LK1PT  is  decremented  by  one,  line  LK1PT  is  moved  to  line  I, 
and  control  is  transferred  to  point  C  in  the  flow  chart.   If  this 
occurs  while  processing  line  J  of  LK2,  then  LK2PT  is  decremented 
by  one,  line  LK2PT  is  moved  to  line  J,  and  control  is  transferred 

to  D. 

Exit  A  is  taken  when  sufficient  lookahead  has  been  gener- 
ated for  the  first  FPL  production  to  prevent  it  from  precluding  the 
second.   Comparison  with  the  rest  of  the  FPL  productions  is  then 
continued.   If  further  preclusions  occur  the  precluded  production 
is  expanded  into  LK2,  LK2PT  is  set  equal  to  two,  and  the  procedure 
outlined  by  Figure  1  is  repeated.   The  lookahead  already  generated 
may  prove  sufficient,  or  an  extension  of  it  may  be  generated.   If 
Exit  A  is  taken  and  no  further  preclusion  exists,  then  lookahead 
generation  is  complete  for  this  FPL  production.   Each  line  of  LK1 
represents  a  different  lookahead  string  and  word  two  indicates  how 
many  of  the  symbols  are  meaningful.   The  entire  process  is  then 
repeated  for  the  next  FPL  production  of  the  group. 
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Exit  B  is  taken  when  the  two  FPL  productions  cannot  be 
differentiated  by  lookahead.   In  this  case  the  second  is  combined 
with  the  first  to  form  a  new  production  which  replaces  the  first 
in  the  table.  The  second  production  is  also  removed.  The  com- 

.ed  production  then  is  compared  to  the  following  productions 
just  as  any  other  production  would  be.   The  only  difference  is  that 
an  expansion  of  it  into  LKL  creates  one  line  for  each  of  the  orig- 
inal productions  which  were  combined  to  create  it.   For  this 
reason,  the  descriptors  of  those  productions  must  be  retained  until 
the  processing  of  the  new  production  is  complete. 

If  the  group  being  created  is  not  an  Nta  or  Ntb  (tt  ,  n) 
group,  then  the  production  is  converted  into  pseudo  orders  as  soon 
as  this  processing  is  complete. 

Nta,  Ntb  (ir,  n),  CNtb  (m)  Groups 

The  building  of  CNtb  (m)  groups  is  similar  to  that  described 
in  section  6.6  but  has  the  following  difference.   The  table  of  FPL 
productions  must  be  semi-ordered  with  the  ELR  productions  first. 
The  ELR  productions  preclude  the  non-ELR  productions  even  though 
the  stack  comparison  strings  are  not  identical.  ELR  and  non-ELR 
productions  may  not  be  combined.  Lastly,  pseudo  orders  for  the  bar 
..ibol  removal  production  must  be  generated  after  processing  of  the 
ELR  productions  and  before  processing  of  the  non-ELR  productions. 
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The  Nta  and  Ntb  (tt,  n)  groups  are  created  by  first  building 
an  FPL  production  table  which  contains  only  the  ELR  productions. 
This  table  is  then  processed  exactly  as  described  in  section  6.6. 
Upon  completion  of  the  processing  of  one  of  these  productions,  the 
production  and  all  non-empty  lines  of  LK1  are  saved  in  a  separate 
buffer. 

A  FPL  production  table  which  contains  the  non-ELR  class  A 
productions,  is  then  built.   The  ELR  productions,  and  the  partial 
lookaheads  associated  with  them,  are  then  compared  with  each  of 
these  productions  just  as  if  the  class  A  productions  had  followed 
them  in  the  FPL  production  table  originally.   Combination  of  ELR 
and  non-ELR  productions  is  allowed  in  this  case.  When  each  of  the 
ELR  productions  has  been  processed  it  is  converted  to  pseudo  orders. 
The  remaining  class  A  productions  are  then  processed  and  converted 
to  pseudo  orders  in  the  normal  manner.   This  completes  the  construc- 
tion of  Nta  group. 

To  build  each  of  the  Ntb  (ir,  n)  groups  a  FPL  production 
table  is  built  which  contains  the  single  appropriate  class  B  non-ELR 
production.   The  procedure  then  followed  is  the  same  as  for  the  Nta 
group  with  the  following  exceptions.   The  ELR  productions  preclude 
the  non-ELR  production  even  though  the  stack  comparison  strings  are 
not  identical.   Combination  of  ELR  and  non-ELR  productions  is  not 
allowed.   Pseudo  orders  for  the  bar  symbol  removal  production  must 
be  generated  before  they  are  generated  for  the  non-ELR  production. 
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A  list  of  the  pseudo  orders  that  may  be  generated  is  given 
in  Appendix  A.  Appendix  B  gives  a  complete  listing  of  the  output 
from  the  B5500  syntax  preprocessor  program  SYNPROF/TWS  for  a  BKF 
specification  of  a  grammar  for  the  language  DEMALGL,  a  subset  of 
ALGOL. 
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7-   OPTIMIZATION  AMD  ERROR  RECOVERY 

There  are  a  number  of  ways  in  which  both  the  algorithm 
and  its  implementation  may  be  optimized.   Some  of  these  have  already 
been  described  in  chapter  six.   That  is  to  say,  the  minor  changes 
of  the  algorithm  necessary  for  its  implementation  may  be  considered 
optimizations. 

7-1  Stack  Comparisons 

The  first  optimization  to  be  described  is  the  one  which 
most  effects  the  speed  and  efficiency  of  a  translator  based  on  the 
FPL  productions  generated  by  this  algorithm.   It  is  recognition  of 
the  fact  that  immediately  after  a  scan,  no  stack  comparison  need 
be  done  except  of  the  top  symbol  of  the  stack.   To  see  this,  it  is 
necessary  to  consider  the  situation  upon  entry  to  each  of  the  six 
types  of  groups. 

The  BNF  production  which  causes  generation  of  a  t  (ir,  n) 
group  is  of  the  form  N  -*  cut  .  .  . .  The  FPL  production  which  calls 
for  a  transfer  to  this  t  (tt,  n)  group  is 


a   |     |     *  t  (tt,  n) 


and  the  single  production  in  the  group  is 


at 
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The  test  for  a  preceding  bar  symbol  and  a  stack  comparison  for  the 
string  a  is  obviously  unnecessary  because  if  these  tests  were  to  fail, 
the  transfer  to  t  (tt,  n)  would  never  have  been  made. 

The  Ct  (m)  group  is  similar,  except  that  it  arises  from 
a  set  of  BNF  productions : 


N   ->  at 


Np  -*  Oft  .  .  . 


N,  -+  at  .  .  . 
k 


The  transferring  FPL  production  is: 


a  I  *  Ct  (m) 


The  Ct  (m)  group  contains  the  PFL  productions 


at 
at 

at 


or  combinations  thereof.  A^ain,  only  a  check  of  the  top  of  the 
:'or  the  symbol  t  is  necessary. 
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All  transfers  to  Nh  or  ch  (ra)  groups  are  made  from  FPL 
productions  of  the  form: 

a  |   ••-  Q  |  *  L 

where  Q  is  N  (if ,  n) ,    (m),  or  N*  (m)  and  L  is  Nh  or  ch  (m) . 
Therefore,  stack  level  two  must  contain  a  bar  symbol  upon  entry 

to  an  Nh  or  ch  (m)  group.   Since  all  the  FPL  productions  in  these 
groups  are  of  the  form 


the  implicit  lookback  for  a  bar  symbol  is  unnecessary. 

All  transfers  to  Nta  groups  are  made  from  FPL  productions 
of  the  form: 

a    |   -*  N  |      DNt 

After  this  reduction  stack  level  two  is  some  bar  symbol  and  stack 
level  one  is  N.  All  FPL  productions  of  an  Nta  group  are  of  the  form: 


N 


Thus  no  stack  comparison  is  necessary  and  lookahead  alone  determines 
which  FPL  production  applies.  Using  the  implementation  described 
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in  chapter  six,  the  last  production  of  an  Nta  group  will  have  no 
lookahead  and  must  apply.   Therefore,  no  error  production  is 
necessary  after  this  group. 

The  configuration  of  the  stack  upon  entry  to  a  Ntb  (ir,  n) 
or  CNtb  (m)  group  is  the  same  as  upon  entry  to  a  Nta  group.   These 
groups  begin  with  the  ELR  productions  of  the  form: 


N 


which  requires  no  stack   comparison.      The  dynamic  transfer  DNt   is 
defined  in  such  a  way  that  the  bar  symbol  removal  production  must 
apply  if  the  transfer  is  to  a  Ntb   (ir ,   n)    or  CNt   (m)    group,    so  no 
stack  comparison  is  necessary  for  it.      This  leaves  only  the  non-ELR 
production(s) .      If  the  bar   symbol   is  N  or  N  *   or   (m)   where 
N  or  N  *  e   (m),    then  it  was  pushed  into  the  stack  by 

a    |      4-  N  pr  N  *    |  *  Nh 

or       a   |      «■     (m)  *   ch  (m) 

In  either  case,  this  came  from  a  BNF  production  of  the  form 

M  -*  r/ll   .    .    .  (and  possibly  others  of  the  form  M  ->  OK  .  .  .,  where 

:  N).   The  transfer  DNt,  in  each  case,  is  to  Ntb  (tt,  n)  or  CNtb  (m) 
which  has  as      on -ELR  product ion( s) : 


cm 
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which  applies  with  no  stack  comparison.  Again,  only  lookahead 
differentiates  between  the  productions  and  no  error  production  is 
necessary. 

7.2  Variable  Lookahead  Length 

The  lookahead  generation  was  described  in  section  6.6  as 
being  done  once  for  a  given  maximum  lookahead  length.  A  vast 
majority  of  groups  for  any  reasonable  grammar  (based  on  actual  use 
of  a  translator  using  this  algorithm)  can  be  built  with  a  look- 
ahead  of  zero  or  one  symbols.  All  of  the  nontrivial  grammars  which 
have  been  input  to  the  TWS  have  had  a  few  groups  which  needed  up 
to  three  symbols  of  lookahead.  A  one  symbol  lookahead  generation, 
which  assumes  a  possible  maxijmum  of  three  symbols,  is  very  ineffi- 
cient since  extra  (unnecessary)  symbols  must  be  retained  at  each 
step. 

To  overcome  this  inefficiency  the  algorithm  without  com- 
bination of  productions  is  first  applied  with  an  assumed  maximum 
lookahead  length  of  one.   The  names  of  groups  which  cannot  be  built 
in  this  first  pass  are  stored  in  a  table.   The  algorithm  with  com- 
bination of  productions  and  a  maximum  lookahead  length  specified  by 
the  language  designer  (up  to  fifteen,  with  default  of  three)  is  then 
applied.   This  two  pass  approach  has  proved,  in  practice,  to  signif- 
icantly increase  efficiency. 
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7»3   .  .  ■  rminal  Symbol  Expansion 

The  flow  chart  for  lookahead  generation  (Figure  l)  indi- 
cates a  looping  on  the  procedure  NT2T.   The  purpose  of  this  looping 
is  to  ret  all  n  (or  less)  symbol  expansions  which  begin  with  a 
terminal  symbol  of  a  particular  nonterminal.  In  this  case  n  is  the 
number  of  symbols  which,  beginning  at  word  K,  would  fill  line  I 
of  LK1.   Such  expansion  of  a  nonterminal  symbol  may  occur  several 
times  in  the  building  of  the  various  groups. 

A  special  procedure  has  been  added  which  uses  the  princi- 
ple of  the  loop  through  NT2T  to  generate  the  desired  expansion  and 
save  the  results,  or  simply  returns  the  strings  directly  if  the 
expansion  has  been  done  previously. 

The  use  of  this  procedure  replaces  the  two  loops  in 
re  1.   This  also  significantly  improves  efficiency. 

J .k       >seudo  Orders 

The  generation  of  pseudo  orders  will  not  be  discussed  in 
any  detail  since  it  is  relatively  obvious  what  orders  are  necessary. 
Three  things,  however,  should  be  noted. 

First,  two  different  tests  for  terminal  symbols  in  the 
stack  or  lookahead  must  exist.   One,  for  identifiers,  numbers,  and 
strings,  tests  only  the  type,  and  another  tests  both  type  and  entry. 

It  was  shown  in  section  7-1  that  the  entire  stack  compar- 
ison is  unnecessary.  It  is,  however,  necessary,  for  productions  of 
the  form: 

a  I   -*  N  I     DNt 
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to  retain  the  length  of  the  string  a,  because  that  is  the  number  of 
symbols  which  must  be  replaced  by  N  at  the  top  of  the  stack. 

A  relatively  complex  set  of  pseudo  orders  combined  with 
proper  ordering  of  the  lookahead  strings  for  a  given  FPL  production 
can  reduce  the  number  of  symbols  which  must  actually  be  compared. 
A  simple  example  is  the  following.  Assume  the  lookahead  strings 
for  a  production  are : 

a  b  c 
e  f  g 
a   b   d 

If  these  are  reordered  to 

a  b  c 
a  b  d 
e   f   g 

it  is  easily  seen  that  two  separate  tests  for  ab  are  unnecessary. 

7-5  Error  Recovery 

Section  7*1  shows  that  the  bar  symbols  are  not  necessary 
for  lookback.   They  continue,  however,  to  serve  two  very  useful 
functions.   The  first,  which  has  already  been  discussed,  is  to  act 
as  the  parameter  for  the  dynamic  transfer,  DNt.   Their  second,  and 
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equally  important,  function  is  their  use  in  error  recovery.   Error 
recovery  is  not  a  necessary  part  of  the  algorithm,  but  a  translator 
based  on  the  algorithm  must  have  same  form  of  error  recovery  if  it 
is  to  be  practically  useful. 

One  of  the  tables  output  by  the  TWS  for  the  skeleton 
program  is  a  set  of  lists  of  all  terminal  symbols  which  may  follow 
each  nonterminal  symbol.   These  lists  are  generated  by  properly 
initializing  LK1  and  then  initiating  a  one  symbol  lookahead 
.eration. 

A  syntactic  error  is  found  in  the  input  string  by  execu- 
tion of  the  error  production  at  the  end  of  same  group.  When  this 
occurs,  the  stack  is  searched  for  the  uppermost  bar  symbol.   If 
this  symbol  is  N  or  N  *,  the  input  stream  is  scanned  until  one  of 
the  terminal  symbols  t  which  may  follow  N  is  found.   The  symbols 
between  the  N  or  N  *  in  the  stack  and  the  t  in  the  input  stream 
are  deleted.   The  nonterminal  symbol  N  is  then  pushed  into  the 
stack  and  the  dynamic  transfer  to  DNt  is  made. 

If  the  uppermost  bar  symbol  is  (m),  then  the  input  stream 
is  scanned  for  a  terminal  symbol  t  which  may  follow  any  of  the  non- 
terminal symbols  M  which  are  such  that  M  e  (m)  or  M  *  e    (m).   If  t 
may  follow  only  one  of  these,  the  procedure  described  above  is 
applied.   If  it  may  follow  several,  then  one  is  chosen  arbitrarily 
and  the  procedure  is  appli( 

This  error  recovery  can  be  extremely  effective  or  extremely 
ineffective  depending  on  the  state  of  the  parse  when  an  error  is 
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encountered.   The  worst  case  occurs  when  the  uppermost  bar  symbol 
is  S.   If  S  is  not  recursive  this  causes  the  entire  input  string  to 
be  deleted.   Experience  indicates,  however,  that  the  error  recovery 
just  described  is,  in  general,  about  as  effective  as  that  of  many 
hand  coded  compilers,  particularly  for  languages,  such  as  ALGOL, 
which  are  specified  recursively. 
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8.   CONCLUSION 

A  TWS  based  on  the  algorithm  described  in  the  preceding 
chapters  has  been  fully  implemented.   In  general,  it  satisfies  the 
goals  set  for  it  in  chapter  1.  This  chapter  briefly  compares  it  to 
other  parsing  algorithms  and  discusses  the  subclass  of  grammars 
which  it  will  accept.  There  are  also  brief  discussions  of  problems 
yet  to  be  solved,  of  potential  improvements,  and  of  areas  of  further 
research. 

8.1  Precedence  Systems 

The  parsing  algorithm  can  be  compared  to  the  precedence 
systems  of  McKeeman  [1],  and  Wirth  and  Weber  [2]  in  the  sense  that 

of  three  stack  actions  is  specified  by  each  FPL  production.  The 
first  of  these  is  to  do  nothing,  which  is  equivalent  to  a  precedence 
relation  of  =  .   The  second  is  to  push  a  bar  symbol  into  the  stack. 
This  is  done  when  a  precedence  relation  of  <•  exists  and  occurs  at 
the  top  of  the  stack,  rather  than  down  in  the  stack  as  in  the  tech- 
niques of  McKeeman,  and  Wirth  and  Weber,  but  effectively  serves  the 
same  function.  The  third  stack  action  is  to  make  a  reduction,  which 
equivalent  to  a  precedence  relation  of  •>  at  the  top  of  the  stack. 

One  advantage  of  the  conversion  algorithm  (and,  hence,  of 
FPL  rec<      ::   produced)  is  that,  with  one  exception,  it  is  able 
to  make  use  of  more  context  than  is  employed  in  precedence  relations 
in  determining  the  bounds  of  a  substring  to  be  reduced. 
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In  determining  the  existence  of  a  <•  relation  both  prece- 
dence schemes  consider  one  symbol  below  it  in  the  stack.  McKeeman 
makes  use  of  two  symbols  above  it  and  Wirth  and  Weber  one  symbol. 
The  conversion  algorithm  uses  a  minimum  of  one  symbol  below  it  in 
the  stack  plus  further  information,  which  may  be  implicit  in  the 
grouping,  and  a  lookahead  (corresponds  to  symbols  above  it  in  the 
stack)  of  n  symbols,  where  n  is  usually  greater  than  two,  to  decide 
when  it  is  necessary  to  push  a  bar  symbol  into  the  stack. 

In  determining  the  existence  of  a  •>  relation  at  the  top 
of  the  stack,  both  precedence  schemes  use  a  one  symbol  lookahead 
while  the  conversion  algorithm  uses  n  (n  >  l)  symbols.  Wirth  and 
Weber's  system  looks  one  symbol  into  the  stack  to  determine  this 
relation,  while  McKeeman' s  system  looks  at  two  symbols.   One  symbol 
is  the  minimum  used  explicitly  by  the  conversion  algorithm  and  the 
information  available  in  the  second  symbol  is,  except  in  a  Nta  group, 
implicit  in  the  grouping. 

Another  advantage  of  the  generated  FPL  recognizers  over 
the  precedence  schemes  is  that  they  do  not  allow  two  BNF  productions 
with  identical  RHS's.  This  is  particularly  important  if  several  peo- 
ple are  involved  in  the  design  of  a  language,  or  if  the  BNF  grammar 
is  to  be  used  descriptively,  as  in  the  ALGOL  report. 
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8.2  Right  Bounded  Context 
Let 


a  =  x, .  .  .  x  •  .  .x, ,   x.  €  V  u  v. 

1       m       tint 


Define  (n,  p)  to  be  a  handle  of  a   if  (for  k  >  0)  N  -»  x,   .  .  .x  is 
the  p-th  production  and  x^   .  .  .x  is  the  leftmost  substring  of  a 
which  appears  as  a  RHS  of  a  production  and  a  reduction  to  S  could 
begin  with  a  reduction  of  this  substring  to  N.  Floyd  [12]  defines 
a  grammar  as  right  bounded  context  (m,  n),  denoted  LR  (m,  n),  if  any 
handle  is  always  uniquely  determined  by  the  m  stack  symbols  to  its 
left  and  the  n  input  symbols  to  its  right. 

The  conversion  algorithm  clearly  accepts  all  grammars  which 
are  LR  (o,  n),  where  n  is  the  maximum  lookahead  length.   It  does 
accept  grammars  beyond  this.  The  following  grammar 

ir 

1  S  -  a  c  N  t 

2  S  -  b  c  M 

3  N  -  d 

k         M  -  d  t 

LR  (2,  0),  since  in  the  strings  acdt  and  bcdt,  a  decision  as  to 
ther  the  substring  d  is  a  handle  (and  should  be  reduced  to  N)  can 
made  only  by  looking  two  symbols  to  its  left  for  an  a  or  b. 
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Sh:     a|      |     *c  (l,  2) 
b|      |     *c  (2,    2) 

effectively  separates  the  two  cases.   This  example  can  be  extended 
to  LR  (rtij    o)  by  replacing  the  c  in  the  first  two  BNF  productions  with 
identical  terminal  symbol  strings  m  -  1  symbols  long. 
The  grammar 


TV 

1 

S  ->  a  N  c 

2 

S  ->  b  M 

3 

N  -  K 

h 

M  -  K  c 

5 

K  -  d 

which  is  LR  (l,  0),  cannot  be  converted  to  a  (deterministic)  FPL 
recognizer  by  the  algorithm  because  of  the  preclusion  in  the  Kta 
group 

Kta:    K|   -  N  I    DNt 

K|       I    *c  (h,    2) 

which  can  only  be  resolved  by  a  one  symbol  lookback. 
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In  general,  the  subclass  of  Chomsky  type  two  grammars 
accepted  by  the  BNF  to  FPL  conversion  algorithm  does  not  fit  any 
known  classification  scheme. 

8«3  Problems 

The  present  implementation  of  the  conversion  algorithm 
described  in  chapters  2  through  5  has  two  drawbacks.   The  first  is 
that  its  execution  consumes  relatively  large  amounts  of  computer 
time.  The  second,  and  probably  more  serious,  is  that  when  a  large 
grammar  is  not  accepted  by  the  algorithm,  it  is  often  difficult  to 
see  what  changes  should  be  made  in  the  grammar  which  will  make  it 
acceptable  without  altering  the  language  being  defined. 

These  two  problems  tend  to  emphasize  each  other  in  the 
sense  that  the  time  factor  makes  it  impractical  to  make  several  runs 
with  tentatively  changed  grammars  in  an  attempt  to  find  an  accept- 
able one.   For  this  reason  it  is  important  that  the  language  designer 
be  relatively  certain  that  his  grammatical  changes  are  sufficient 
on  the  first  attempt  at  correction.   This,  unfortunately,  is  pre- 
cisely what  it  is  difficult  to  do. 

8.U  Potential  Improvements 

One  improvement  which  is  being  implemented  is  a  rewriting 
the  program  with  more  efficient  coding  in  an  attempt  to  signifi- 
cantly decrease  the  execution  time.  There  is  reason  to  believe  that 
<rts  will  meet  with  at  least  some  success. 
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In  the  near  future  efforts  will  be  made  to  give  the  language 
designer  sufficient  diagnostic  information  to  enable  him  more  easily 
to  make  the  grammatical  changes  necessary  for  acceptance  by  the  algo- 
rithm.  How  well  this  can  be  done  is  unknown  since  it  is  not  yet  clear 
exactly  what  information  is  needed. 

In  general,  the  speed  of  compilers  built  by  a  TWS  based  on 
this  algorithm  will  depend  to  a  great  degree  on  the  time  taken  by  the 
semantic  routines.   It  is  important,  however,  to  make  the  parsing  as 
fast  as  possible. 

Three  things  are  presently  being  done  in  an  attempt  to  in- 
crease the  parsing  speed.  As  can  be  seen  in  the  appendices  the  stream 
of  pseudo  orders  is  similar  to  a  sequence  of  procedure  calls.  A  pro- 
gram is  being  written  which  will  convert  the  pseudo  order  stream  into 
a  block  of  Burrough's  B5500  ALGOL  code  which,  when  compiled,  should 
execute  faster  than  the  interpretive  method  now  in  use. 

A  program  to  convert  the  initial  BWF  grammar  into  one  with 
less  productions  and  less  nonterminal  symbols  is  also  being  written. 
This  will  be  achieved  by  back- substitution  for  a  nonterminal  symbol 
which  appears  as  the  left  side  of  only  one  BNF  production.   For 
example : 

A->aBp 

B  ->  7 
becomes 
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This  would  speed  up  the  parse  by  cutting  down  the  number  of 
reductions,  and  associated  stack  manipulations,  necessary  at  parse 
time. 

A  third  improvement  is  to  keep  the  bar  symbols  in  a  separate 
stack  and  thus  eliminate  the  stack  manipulation  necessary  for  bar 
symbol  removal.  This  is  possible  since,  as  shown  in  chapter  7>  the 
bar  symbols  are  not  actually  used  for  lookback. 

Another  potential  improvement  is  the  definition  of  a  richer 
meta- syntactic  language.  As  presently  envisioned  this  would  be  some- 
thing similar  to  the  regular  expression  of  a  Chomsky  type  three 
grammar  as  the  RHS  of  productions.   This  would  then  be  mechanically 
converted  to  BNF  productions  by  a  prepass.   It  is  believed  that  this 
would  make  writing  the  syntax  easier  for  the  language  designer. 

The  final  improvement  presently  being  considered  is  a  rela- 
tively complex  combination  of  ideas  to  improve  the  automatic  error 
recovery  of  compilers  built  by  a  TWS  based  on  this  parsing  algorithm. 

8. 5  Research  Extensions 

Three  interesting  areas  of  research  appear  worthy  of  pursuit. 
The  first  is  a  new  method  of  subclassification  of  Chomsky  type  two 
grammars  which  would  clarify  the  properties  which  make  a  grammar  un- 
acceptable to  this  conversion  algorithm. 

The  second  is  the  investigation  of  the  similarity  of  the 
FPL  groups  to  the  states  of  a  finite  state  machine.   This  area  is 
presently  being  investigated  by  F.  L.  DeRemer  at  MIT. 

The  final  area  is  the  application  of  the  principles  of  this 
algorithm  to  Chomsky  type  one  and  type  zero  grammars. 


67 


APPENDIX  A 
THE  PSEUDO  ORDERS 


Njfy0P 


dummy  placeholder  for  array  row  ends 


updating  pointers 
LLVL 
ILVL 


initialize  lookahead  test  pointer 
increment  stack  pointer 


stack  symbol  tests 

XSBT   (T,  A) 

XSBE   (E,  A) 

XSIT   (T) 

XSIE   (E) 


test  stop  stack  symbol  type-field  for 
value  T,  false  =>  branch  to  A 

test  top  stack  symbol  entry- field  for 
value  E,  false  =>  branch  to  A 

test  top  stack  stymbol  type-field  for 
value  T,  false  =>  print  error  message 
and  insert  correct  symbol 

test  top  stack  symbol  entry-field  for 
value  E,  false  =>  print  error  message 
and  insert  correct  symbol 


lookahead  symbol  tests 
XLAT   (T,  A) 

XLAE   (E,  A) 

XLBT   (T,  A) 

XLBE   (E,  A) 


test  lookahead  symbol  type- field  for 
value  T,  true  =>  branch  to  A 

test  lookahead  symbol  entry- field  for 
value  E,  true  =>  branch  to  A 

test  lookahead  symbol  type-field  for 
value  T,  false  =>  branch  to  A 

test  lookahead  symbol  entry-field  for 
value  E,  false  =>  branch  to  A 
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XLLT   (T) 


XLLE   (E) 


test  lookahead  symbol  type-field  for 
value  T,  false  =>  branch  to  next 
production 

test  lookahead  symbol  entry- field  for 
value  E,  false  =>  branch  to  next 
production 


stack  manipulation 
TPSH 

NP0P  (N) 
RED1  (S) 
REDN   (N,  S) 


push  next  terminal  symbol  to  stack 

pop  N  symbols  from  stack 

reduce  top  stack  symbol  to  symbol  S 

pop  N  symbols  from  stack,  then  reduce 
top  stack  symbol  to  symbol  S 


bar  stack  manipulation 
BP3H  (B) 
BP0P 


push  bar  symbol  B  into  bar  stack 
pop  a  bar  symbol  from  bar  stack 


program  control 

SETS  (A) 

Gj&Tp  (A) 

XBGp  (E,   A) 

SKIP  (N) 


store  A  as  address  of  next  production 

transfer  to  address  A 

test  bar  symbol  entry- field  for  value  E 
true  =>  branch  to  address  contained  in 
bar  symbol 
false  =>  branch  to  A 

skip  next  N  characters  of  instruction 


antic  calls 

EXSM  (E) 
XTSM  (E) 


execute  EXEC  (E) 

execute  EXEC  (E),  test  SEMANTICTEST 
false  =>  branch  to  next  production 
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error 


XSLR  (E) 

ERRT  (E,  A) 

ERRN  (S,  A) 

ERRR 


test  for  lookahead  symbol  valid  after 
nonterminal  E,  false  =>  execute  ERRR 

insert  terminal  symbol  E  a.nd  go  to  A 

change  stack  to  S  (reduce  or  push  bar) 
by  going  back  to  address  A 

skip  to  first  symbol  that  can  follow  bar 
symbol  and  put  appropriate  nonterminal 
in  stack  and  branch  to  address  in  bar 
symbol 
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